Download Notebook

Analyzing the JWST Pipeline products of HAT-P-1b observations with NIRISS/SOSS


Author: Néstor Espinoza (nespinoza@stsci.edu) | Latest update: August 23, 2020.

Table of contents

  1. Introduction
  2. Spectral extraction
    1. Tracing the orders
    2. Extracting the spectra
    3. Time-stamps and wavelength solution
  3. Fitting & analyzing white-light lightcurves
    1. Studying the residuals
  4. Fitting & analyzing the wavelength-dependent lightcurves
  5. Studying the transit spectrum of HAT-P-1b

1.-Introduction

This notebook is part of a series of notebooks that are being prepared by STScI in order to showcase how to simulate, process and analyze JWST observations for a wide range of science cases. Here, we touch on the transiting exoplanet observations science case and, in particular, on spectrophotometric observations of the primary transit of an exoplanet. During primary transit, the observed flux decrease due to the planet blocking light from the stellar host is proportional to $\delta = (R_p/R_*)^2$ --- a quantity known as the transit depth, where $R_p$ is the planetary radius and $R_*$ is the stellar radius. Interestingly, the transit depth is wavelength dependent; i.e., $\delta \equiv \delta (\lambda)$. This is because opacity sources on the planetary atmosphere absorb different amounts of light at different wavelengths and, therefore, the observed planetary radius $R_p$ --- and thus occulted area during transit, $\delta$ --- is wavelength-dependent (see, e.g., Kreidberg 2018 for a review). This technique, referred to as transmission spectroscopy in what follows, aims at obtaining those transit depths as a function of wavelength in JWST observations through the study of ultra-precise transit lightcurves at different wavelengths.

Simulated data for NIRISS/SOSS observations targeting a transit of HAT-P-1b have been generated with the help of the awesimsoss simulator by the NIRISS team. These simulations, in turn, have been calibrated using the JWST pipeline. HAT-P-1b is a Guaranteed Time Observations (GTO) target of the NIRISS Exploration of the Atmospheric diversity of Transiting exoplanets (NEAT) program, which will be observed using the NIRISS/SOSS instrument onboard JWST. This target is also used as an example science program used in the JWST documentation. If you are not familiar with transiting exoplanet observations and/or with how JWST will observe planetary transits, we encourage you to read through that example science program in order to familiarize yourself with the terminology.

In this notebook, we analyze the 2D spectral products of the JWST pipeline in order to analyze the transmission spectrum of this exoplanet. In particular, the data we take a look at here has been processed by the JWST pipeline up to the assign_wcs step of the pipeline's Stage 2 processing. The simulations didn't include flat-fielding, so we don't apply the flat-field step of the pipeline to them. Although the JWST Pipeline will eventually be able to produce white-light lightcurves, as well as extracted 1D spectra, as of the day of writing of this notebook, the JWST calibration pipeline does not have a spectral extraction algorithm that can deal with the complicated structure of NIRISS/SOSS data. In particular, the SUBSTRIP256 subarray has data from at least two NIRISS/SOSS orders, which overlap at the reddest wavelengths. An algorithm to properly extract this data is in-the-making, but in this notebook we will be performing our own tracing and extraction of the spectra in order to showcase how to analyze the JWST pipeline products. After extracting the spectrum, we will generate the white-light lightcurves of each order, fit them and compare the extracted parameters to the ones in the literature. Then, we will repeat the procedures for the wavelength-dependent lightcurves, with which we will get the transmission spectrum of the exoplanet.

Before we begin, let's import some libraries:

2.-Spectral extraction

A.-Tracing the orders

Before we can go ahead and perform spectral extraction of the orders, we need to trace them in order to guide the spectral extraction algorithm on where the spectrum is. Let's first do some data exploration to understand how we might do this with the complicated spectral profile of NIRISS/SOSS.

The products with which we will be working with are the so-called "rates". These combine the information of all the groups in a given integration into one number, which is the slope of the up-the-ramp samples in a given integration (so they have units of counts per second). These calibrated data have been already corrected by various inhomogeneities present in the raw JWST data, such as bias and dark current. Let's download them (or load them if they are already on the same folder as this notebook), along with some supplementary data we will be using for this notebook --- if downloading, be patient, as this might take a while (the calibrated data are 12 GB worth of data!):

Let's open the calibrated data, and explore its format and content:

The data we will be mostly interested in working with is mainly on the SCI and ERR keys --- the former has the science data (the "rates") and the latter their errors. There is a lot of extra information in these pipeline products (e.g., the wavelength solution is here, as well as the time-stamps of the observations), but we will get to that in time. As can be seen, the SCI and ERR keys have three dimensions: the 2048 and 256 dimensions are the spatial ones (i.e., the 2D spectral dimensions). 1198, on the other hand, is the number of integrations of this particular simulated observation. This latter dimension is specific to time-series observations data/analysis, as here we are interested in using the rates for each integration separately.

Let's extract the science data (SCI) and errors (ERR) then:

There are several ways moving forward to trace the spectrum. In theory, when JWST is on the sky, precise shapes and positions will be available for them, and we could use those. Here, however, we will trace each integration individually, in order to check for possible movements/shape changes of the trace during the observations. Before doing this, however, let's generate a "median" image by collapsing all the products in time, so as to have a view of the "average" shape of the traces along the entire exposure:

Let's plot it:

This plot tells us a lot about how to move forward with the data analysis in general of this dataset. For example, note that although the pipeline has done a good job at removing structure in the data, there is still some structure which appears as vertical strips. This is a well known pattern due to the readout of IR detectors which is typically referred to as "1/f noise". By eye it seems some simple column-by-column background substraction should take care of that. On top of this, this image tells us a little bit of the care we have to have with tracing. For example, spectra of order 1 and 2 start to overlap around pixel ~750. In addition, we can see how the right side of both traces are the ones that have the most signal, which decreases to the left-hand side (redder wavelengths) --- so if we do any kind of iterative tracing, we should perhaps start in those sides of the trace. With tracing we also have to be careful with the sides of the image: both sides have 4 reference pixels each (i.e, pixels 0 to 3 and 2044 to 2048) which are not sensitive to light, so we have to be careful to leave those out of our analyses. Finally, it is important to note that the order in the bottom of this image (order 2) ends rather abruptly at around column 1750; this is actually a problem of the simulations, and not a real feature of the NIRISS/SOSS traces.

Let's use all of the above to our advantage to trace the spectra of each integration. First, let's write a small script to get the centroids of each column for each order on each integration. Here we'll use a very simple algorithm in which we convolve each column with a gaussian filter (in order to smooth the profile, which is rather complex), and then take the centroiding of that filtered image. We repeat this procedure in each of the columns of the image in order to obtain the centroids of the traces, and then we fit them with a polynomial:

Let's try our algorithm first for order 1 (the one on the top). Let's start it on column 2043 (as columns larger than that have reference pixels, so no signal) which, by eye, has the center of the profile around pixel 70 in the median image. Let's do it for all integrations, and plot all of them on top of the median image to visually see our results:

Woah, pretty good for such a simple approach! There are evidently some parts of the traces that go a bit off-board. In particular, the trace goes off at values below around 30. Let's fit a Chebyshev polynomial for each of them so we can smooth the shape a little bit, being careful not to use these pixels that throw the trace off in the fit. We could do some iterative fitting but, as we will see below, this is not terribly important. To select the best order for this polynomial, let's write a small function that does model selection using the Bayesian Information Criterion (BIC) for each trace, and then select the best order via "majority vote":

That's pretty good!

Now, we can see how the same procedure is deemed to fail for order 2 (the lower trace) for columns below around 1000. This is because this order quickly starts to overlap with order 1 below this. We will help our algorithm a little bit then by masking the upper trace, and only fitting the trace for values above column 1000. To this end, we will use the same algorithm for obtaining the centroids of the trace but will pass a masked image, containing only the lower trace. To do this, we will multiply all the values below row 80 (i.e., the upper pixels in the plot above) by zero.

We'll start our algorithm in column 1839 (the edge of the trace here), which has its centroid, by eye, around row 225. We also make sure to stop tracing for pixels smaller than 1000. Let's see how we do:

Pretty good! Let's smooth the trace, as done for Order 1, with a Chebyshev polynomial. Let's repeat the same procedure:

This looks great! Let's do a final plot showing the traces corresponding to the first integration on the median image:

That's pretty good for such a simple procedure!

How does the trace position change as a function of each integration? Was our integration-by-integration tracing really worthwile? Let's map this out using the location of the center of the trace as a proxy of the trace shift for each order; let's use the same (uncontaminated) column for both orders:

It appears that the impact of the tracing in the case of the simulations is, indeed, very small. The pixel shifts of the traces are about 1/500 - 1/1000 of a pixel. Consequently, at least in the simulations outlined here, we pressume these will not have a big impact on the retrieved lightcurves and transmission spectrum.

B.-Extracting the spectra

With our traces at hand, in theory we can now perform simple extraction of the spectra on each integration. Before doing that, however, let's correct our images for the 1/f-noise patters on a column-by-column basis using the fact that we now know where the spectra is located, so we can mask the traces out of this procedure.

To this end, for each image we will mask all the pixels in a 30-pixel radius around the traces (more or less the radius of the actual portion of the trace that contains flux from the target), and use the remaining pixels to track the column-to-column variations. Let's apply these corrections to the data:

All right, let's now see how we did --- to this end, let's check the corrections made on the first integration:

Much better! The strips observed in the original image have been corrected quite successfully thanks to our procedure.

Let's now move forward and write a small script that is able to do simple aperture extraction given a set of x and y coordinates that follow the trace, and loop that through all of our integrations. In theory before doing this we would take care of bad pixels/cosmic rays, but we don't worry about this on this notebook because (a) cosmic rays have not been included in the simulations made to create this dataset and (b) bad pixels are the same on each image, so they don't impact any time-varying signal.

First, the aperture extraction function:

Let us now extract the spectra. To this end, we have to define an aperture radius for the extraction --- we decide to use a 30-pixel aperture extraction, and a 50-pixel background radius. As shown in the following plot, these regions cover both the spectra and also a big chunk of the background region(s):

Now let's loop over all the integrations, extract the spectra of both orders, and save that to some dictionaries. We note that this will take a while (around 10 minutes). We will save both the spectra and the columns, so we can later relate the latter to wavelength-space:

Finally, let's plot the spectra of the first integration along with the errorbars:

This is pretty good! It is interesting to see how we can actually identify where the contamination from the second order kicks in in the Order 1 spectra (blue) around pixel ~750. This is actually evident from the images above, and something to keep in mind when performing analyses with this data.

Note: the flattening of the spectra for pixels above ~1700 in order 2 is a bug from the simulator.

C. Time-stamps and wavelength solution

Before continuing to the next step, we need to extract two extra data products that will become useful in our analyses in this notebook: (a) the time-stamps of each integration and (b) the wavelength solution corresponding to each pixel in the frame for Order 1 and Order 2.

The first is the easiest to extract --- these will be typically stored in the very same data products of the pipeline we are using here. In particular, recall that each integration is composed of various groups which sample up-the-ramp. These have been, in turn, combined in the rates that we see here. Perhaps the most useful timing measure for those ramps, thus, is the middle of an integration --- these can be extracted from the products as follows:

As for the wavelength solution, there is a particular step in the JWST calibration pipeline that "attaches" this data to the products --- these have been attached to our dataset as well. To extract them, one has to re-open the pipeline products with a so-called datamodel from the JWST pipeline, that lets you gain access to this information:

Let's visualize this 2D wavelength solution for Order 1 and 2:

One important caveat to have in mind is that the wavelength maps are not strictly vertical, i.e., wavelengths do not align perfectly with the columns of the image for NIRISS/SOSS observations. An easy way to see this is to have an image showing "iso-wavelength" bands --- identify pixels that have the same wavelengths and "paint" them in a plot. Let's do this for Order 1:

As can be seen from these maps, this is not extremely critical for NIRISS/SOSS (iso-wavelength bands span ~3 pixels), but it might be important for precise, higher-resolution work to take this into account in the extraction. For our application here, however, we simply take the average wavelength value per column to associate wavelengths with pixels for each order, and we save that to our dictionary:

Let's have a final look at the extracted spectra of the first 100 integrations, but now with these wavelengths as x-axis instead of the pixel columns:

That looks pretty good! Let's deep dive into analyzing the lightcurves that can be extracted from these spectra in the next sections.

3. Fitting & analyzing white-light lightcurves

Having extracted our spectra, we are ready to jump into the "fun" part of this notebook: the analyses of actual simulated NIRISS/SOSS transit lightcurves. Let's first create the white-light lightcurves of both Order 1 and Order 2 by summing the spectra extracted in the previous section, but on regions on which we know there will be no contamination from overlapping orders. This step is important because the white-light lightcurves help determine the most precise transit parameters for the wavelength-dependent lightcurves, and thus we want to have estimates that are as unbiased as possible.

From the figures above, it seems that for Order 1 pixels below around 300 and above 1300 have virtually no contamination from Order 2. For Order 2, fluxes from pixels above around 1300 also have virtually no contamination from Order 1. Let's generate the corresponding lightcurves taking that into account:

Wow! Those look pretty good. There is a notable difference in the overall lightcurve shape here. Part of it can be attributed to limb-darkening (being Order 1 the order covering the reddest wavelengths, limb-darkening produces a more "box-shaped" lightcurve than for Order 2) --- but a portion of it could also be explained by the transit depths themselves.

Let's now fit those transit lightcurves. We will fit them separately to see what we obtain from each order, compare and discuss the results. To perform the transit fitting, users can of course use any tool they want. In this notebook, we will use juliet (http://juliet.readthedocs.io/), which is a tool that allows to perform lightcurve fitting through Nested Sampling algorithms, so we can thoroughly explore the parameter space.

We first import juliet and write the priors for the parameters of our fit. We will leave the period, $P$ (P_p1) of the orbit fixed. The parameters we will be fitting for are the time-of-transit center $t_0$ (t0_p1), two parameters that parametrize the planet-to-star radius ratio $R_p/R_*$ and the impact parameter of the orbit $b = (a/R_*)\cos i$ in a unitary uniform plane, $r_1$ (r1_p1) and $r_2$ (r2_p1 --- see Espinoza (2018) for details), two parameters that parametrize the limb-darkening through a square-root limb-darkening law, $q_1$ (q1_SOSS) and $q_2$ (q2_SOSS --- see Kipping (2013), the stellar density $\rho_*$ (rho), a scaling normalization factor for the out-of-transit flux (mflux_SOSS) and an unknown standard deviation added in quadrature to the errorbars of our data, sigma_w_SOSS. On top of this, we fix some parameters in our fit as well: we don't fit for the eccentricity $e$ (ecc_p1) or argument of periastron passage $\omega$ (omega_p1), as a single transit does not hold much information on these parameters. Similarly, we fix any possible dilution of the transit lightcurve by background sources (mdilution_SOSS) to 1, so we assume no dilution is in place. We take very wide priors for all the parameters that are left as free parameters in our fit except for the period, which we assume is well-known:

Having defined the parameters, priors and the hyperparameters of those priors, we now fit both datasets. For this, we pass the data in a juliet-friendly format, and fit each transit individually:

Let's see what the fits look like:

Those look pretty good! The precisions more or less match what is expected from NIRISS/SOSS white-light lightcurves. What about the posterior distribution of the parameters?

All right! It seems as if all the posteriors are more or less consistent with each other. However, there is a clear separation in the limb-darkening coefficients plane (despite the average values of the coefficients being consistent) --- this is expected; the profiles are expected to be different for both orders given the different wavelength ranges they encompass.

A. Studying the residuals

One might wonder if there is any structure in the residuals. This structure can give rise to signals that, if not accounted for, might led us to believe we have a precision that is much better that what the dataset actually has to offer. A classic approach to performing a quick check on the residuals is to see if, as you bin more datapoints, their rms decreases with the square-root of the number of datapoints. If the data is distributed as gaussian random noise, then one should see a $1/\sqrt{N}$ decline in this plot, where $N$ is the number of datapoints in a given bin. This is an easy check to make in this case:

The plots look very promising, but it is hard to see if there is any particular frequency at which we should be paying attention to. What about the power spectrum of the residuals? This basically transforms the time-series to Fourier space where we can actually see if there is any evidence for residual signals at different time-scales/frequencies:

This looks fairly good! There is apparently no clear peak in these power spectral densities, with peaks evenly distributed across all frequencies --- i.e., the noise pattern appears to be fairly "white". We thus move ahead with that hypothesis, and assume there is no residual signal and/or correlated noise in our fits left to mode.

4. Fitting & analyzing the wavelength-dependent lightcurves

One of the main products that ought to be analyzed when observing transits with JWST are the wavelength-dependent lightcurves. Fitting those lets us obtain the final product of our observations: its transmission spectrum. To retrieve this, let us fit the wavelength dependent lightcurves for each order. Unlike as with HST and ground-based low resolution lightcurves, JWST's precision is so impressive in cases like this that we won't need to bin the lightcurves --- we can work at the (extracted) "pixel"-level!

First, let's extract those lightcurves, and lets plot them in a slightly different way than above: let's create a matrix, where the rows are the wavelengths, and the columns are the time-stamps. The values of this matrix will be the relative fluxes:

Nice! As can be seen, this plot comes very handy. First of all, note how the transit is marked in the dark regions at all wavelengths where it is expected (i.e., where we saw them in the white-light lightcurves --- time of transit around 3.5 hours since start of observations). One can actually see how the lightcurves get noisier for longer wavelengths, which makes sense with the image plots (the traces fade to the left of all of our image/trace plots). On top of this, we can see that some lightcurves seem to be a bit noisier/corrupted than usual for Order 1 at the shorter wavelengths. We won't worry too much about those here, but this is a very important application of this kind of plots, which allow us to visually see outlier lightcurves in one simple image.

Let's now fit those lightcurves. To this end, we once again use juliet --- however, we fix the ephemerides and orbital parameters ($P$, $t_0$, $a/R_*$ and $b$) to those found in the white-light analysis, as they should be wavelength-independent. To this end, we combine the posterior parameters from Orders 1 and 2. In our fits, thus, we only leave the transit depth, the out-of-transit flux and the "jitter" as free parameters.

Let's first combine the orbital parameters using the posterior distributions of our white-light fits:

And define the priors of our fit using those:

And now we ingest those values into our priors, and fit the wavelength-dependent lightcurves. Note that performing the fitting below with NIRISS/SOSS' native resolution can take days (we need to run almost 3,000 juliet runs). Therefore, it is important to consider parallelizing these runs/fits in a real application.

For simplicity (and speed) in rendering this notebook, we have pickled and uploaded the results of these fits --- these were downloaded in the second cell of this notebook above. By default, thus, this notebook will read these downloaded results directly. If you wish to run the fits on your own, however, you can set the use_downloaded_results variable below to False:

If the fits are ran, we create a matrix similar to the above, but now for the residuals of the wavelength-dependent light curves. If results were downloaded, these have been already generated, so we just extract them from these products:

All right! That looks great. The residual image seems to be just random noise from the looks of it!

Let's now see how our transmission spectrum looks. The model transmission spectrum we are using is taken from the ExoCTK website (https://exoctk.stsci.edu/generic), with properties that match that of HAT-P-1b. We'll plot the original, non-binned spectrum and also plot the binned spectrum on top:

Interesting! There are clear hints of the Na and K features in the blue-optical bandpass of Order 2. The Order 1 transmission spectrum, on the other hand, shows a very clear detection of several water bands in the entire NIRISS/SOSS range. There seems to be some problems at around ~1.75-2 microns where the contamination overlap between order 1 and 2 is largest, which is expected given our simple extraction routine used above.

Overall, these results showcase the power of NIRISS/SOSS observations for targeting both optical and near-infrared features in the transit spectrum of HAT-P-1b.